How to represent a document?
What are vector space and bag-of-words models?
Features in text? And how to do text feature selection?
How to classify text data?
How to evaluate a classifier?
How to represent a document?
What are vector space and bag-of-words models?
Features in text? And how to do text feature selection?
How to classify text data?
How to evaluate a classifier?
Supervised learning: Learning a function that maps an input to an output based on example input-output pairs.
infer a function from labeled training data
use the inferred function to label new instances
Human experts annotate a set of text data
Assigning subject categories, topics, or genres
Spam detection
Authorship identification
Age/gender identification
Language Identification
Sentiment analysis
…
Which problem is not a text classification task? (less likely to be)
Author’s gender detection from text
Finding about the smoking conditions of patients from clinical letters
Grouping news articles into political vs non-political news
Classifying reviews into positive and negative sentiment
Go to www.menti.com and use the code 86 08 86 5
Represent by a string?
Represent by a list of sentences?
Represent by a vector?
A vector space is a collection of vectors
Represent documents by concept vectors
Each concept defines one dimension
k concepts define a high-dimensional space
Element of vector corresponds to concept weight
Distance between the vectors in this concept space
The process of converting text into numbers is called Vectorization
Terms are generic features that can be extracted from text
Typically, terms are single words, keywords, n-grams, or phrases
Documents are represented as vectors of terms
Each dimension (concept) corresponds to a separate term
\[d = (w_1, ..., w_n)\]
The Corpus represents a collection of documents (the dataset)
The Vocabulary is the set of all unique terms in the corpus
Bag of Words
Topics
Word Embeddings
With Bag of Words (BOW), we refer to a Vector Space Model where:
Terms: words (more generally we may use n-grams, etc.)
Weights: number of occurrences of the terms in the document
Term as the basis for vector space
Doc1: Text mining is to identify useful information.
Doc2: Useful information is mined from text.
Doc3: Apple is delicious.
Zipf’s law
Zipf’s law tells us:
Binary
Idea: a term is more important if it occurs more frequently in a document
TF Formulas
Let \(t(c,d)\) be the frequency count of term \(t\) in doc \(d\)
Raw TF: \(tf(t,d) = c(t,d)\)
Let \(n_{d,t}\) denote the number of times the \(t\)-th term appears in the \(d\)-th document.
\[TF_{d,t} = \frac{n_{d,t}}{\sum_i{n_{d,i}}}\] Let \(N\) denote the number of documents annd \(N_t\) denote the number of documents containing the \(t\)-th term.
\[IDF_t = log(\frac{N}{N_t})\] TF-IDF weight:
\[w_{d,t} = TF_{d,t} \cdot IDF_t\]
Euclidean distance
\(dist(d_i, d_j) = \sqrt{\sum_{t\in V}{[tf(t,d_i)idf(t) - tf(t, d_j)idf(t)]^2}}\)
Longer documents will be penalized by the extra words
We care more about how these two vectors are overlapped
Feature selection is the process of selecting a specific subset of the terms of the training set and using only them in the classification algorithm.
high dimensionality of text features
Select the most informative features for model training
Reduce noise in feature representation
Improve final classification performance
Improve training/testing efficiency
Less time complexity
Fewer training data
Wrapper method
Filter method
Evaluate the features independently from the classifier and other features
No indication of a classifier’s performance on the selected features
No dependency among the features
Feasible for very large feature set
Let \(p(c | t)\) be the conditional probability that a document belongs to class \(c\), given the fact that it contains the term \(t\). Therefore, we have:
\[\sum^k_{c=1}{p(c | t)=1}\]
Then, the gini-index for the term \(t\), denoted by \(G(t)\) is defined as:
\[G(t) = \sum^k_{c=1}{p(c | t)^2}\]
The value of the gini-index lies in the range \((1/k, 1)\).
Higher values of the gini-index indicate a greater discriminative power of the term t.
If the global class distribution is skewed, the gini-index may not accurately reflect the discriminative power of the underlying attributes.
\({\chi}^2\)-Statistic
Input:
A training set of \(m\) manually-labeled documents \((d_1,c_1),\cdots,(d_m,c_m)\)
a fixed set of classes \(C = \{c_1, c_2,…, c_J\}\)
Output:
Rules based on combinations of words or other features
Accuracy can be high: If rules carefully refined by expert
But building and maintaining these rules is expensive
Data/Domain specifics
Each class is represented by its centroid, with test samples classified to the class with the nearest centroid. Using a training set of documents, the Rocchio algorithm builds a prototype vector, centroid, for each class. This prototype is an average vector over the training documents’ vectors that belong to a certain class.
\[\boldsymbol{\mu_c} = \frac{1}{|D_c|}\sum_{\mathbf{d} \in D_c}{\mathbf{d}}\]
Where \(D_c\) is the set of documents in the corpus that belongs to class \(c\) and \(d\) is the vector representation of document \(d\).
The predicted label of document d is the one with the smallest (Euclidean) distance between the document and the centroid.
\[\hat{c} = \arg \min_c ||\boldsymbol{\mu_c} - \mathbf{d}||\]
Given a test document d,. the KNN algorithm finds the k nearest neighbors of d among all the documents in the training set, and scores the category candidates based on the class of the k neighbors. After sorting the score values, the algorithm assigns the candidate to the class with the highest score.
The basic nearest neighbors classification uses uniform weights: that is, the value assigned to a query point is computed from a simple majority vote of the nearest neighbors. Under some circumstances, it is better to weight the neighbors such that nearer neighbors contribute more to the fit.
Simple (“naïve”) classification method based on Bayes rule
Relies on very simple representation of document
\[P(c|d) = \frac{P(d|c)P(c)}{P(d)}\]
\[P(x_1, x_2, \ldots, x_n|c)\]
Bag of Words assumption: Assume position doesn’t matter
Conditional Independence: Assume the feature probabilities \(P(x_i|c_j)\) are independent given the class \(c\).
\[P(x_1, \ldots, x_n|c) = P(x_1 | c) \cdot P(x_2|c) \cdot P(x_3|c) \cdot \ldots \cdot P(x_n|c)\]
\[C_{MAP} = \underset{c \in C}{\operatorname{argmax}}P(x_1, x_2, \ldots,x_n|c)P(c)\]
\[C_{NB} = \underset{c \in C}{\operatorname{argmax}}P(c_j)\prod_{x \in X}{P(x|c)}\] \[C_{NB} = \underset{c \in C}{\operatorname{argmax}}P(c_j)\prod_{i \in positions}{P(x_i|c_i)}\]
First attempt: maximum likelihood estimates
\[\hat{P}(c_j) = \frac{doccount(C = c_j)}{N_{doc}}\] \[\hat{P}(w_i|c_j) = \frac{count(w_i, c_j)}{\sum_{w \in V}count(w, c_j)}\]
fraction of times word \(w_i\) appears among all words in documents of topic \(c_j\)
Create mega-document for topic \(j\) by concatenating all docs in this topic
What if we have seen no training documents with the word fantastic and classified in the topic positive (thumbs-up)?
\[\hat{P}(\mathrm{"fantastic"|positive}) = \frac{count(\mathrm{"fantastic", positive})}{\sum_{w \in V}{count(\mathrm{w,positive})}}\] Zero probabilities cannot be conditioned away, no matter the other evidence!
\[C_{MAP} = \underset{c}{\operatorname{argmax}}\hat{P}(c)\prod_i{\hat{P}(x_i|c)}\]
\[ \begin{align} \hat{P}(w_i|c) &= \frac{count(w_i,c)+1}{\sum_{w \in V}{(count(w,c)+1)}} \\ &= \frac{count(w_i,c)+1}{(\sum_{w \in V}{count(w,c) + |V|})} \end{align} \]
From training corpus, extract Vocabulary
\(docs_j\leftarrow\) all docs with class = \(c_j\)
\[P(c_j) \leftarrow \frac{|docs_j|}{|total\ \#\ documents|}\]Calculate \(P(w_k|c_j)\) terms
\(Text_j \leftarrow\) single doc containing all \(docs_j\)
For each word \(w_k\) in Vocabulary
\(n_k \leftarrow\) # of occurrences of \(w_k\) in \(Text_j\)Add one extra word to the vocabulary, the “unknown word” \(w_u\)
\[ \begin{align} \hat{P}(w_u|c) &= \frac{count(w_u,c)+1}{(\sum_{w \in V}{count(w,c)})+|V+1|} \\ &= \frac{1}{(\sum_{w \in V}{count(w,c)}) + |V+1|} \end{align} \]
With the kernel trick, SVM can construct a nonlinear decision surface in the original feature space by mapping the data instances non-linearly to a new space where the classes can be separated linearly with a hyperplane.
SVM is quite robust to high dimensionality.
A decision tree is a hierarchical decomposition of the (training) data space, in which a condition on the feature value is used in order to divide the data space hierarchically.
Training set
Test set
Validation set (dev set)
https://scikit-learn.org/stable/modules/cross_validation.html
Accuracy is a valid choice of evaluation for classification problems which are well balanced and not skewed.
Let us say that our target class is very sparse. Do we want accuracy as a metric of our model performance? What if we are predicting if an asteroid will hit the earth? Just say “No” all the time. And you will be 99% accurate. The model can be reasonably accurate, but not at all valuable.
Precision: % of selected items that are correct
Recall: % of correct items that are selected
Precision is a valid choice of evaluation metric when we want to be very sure of our prediction.
Recall is a valid choice of evaluation metric when we want to capture as many positives as possible.
A combined measure that assesses the P/R tradeoff is F measure (weighted harmonic mean):
\[F = \frac{1}{\alpha\frac{1}{P}+(1-\alpha)\frac{1}{R}}=\frac{(\beta^2+1)PR}{\beta^2P + R}\]
The harmonic mean is a very conservative average; see IIR § 8.3
People usually use balanced F1 measure - i.e., with \(\beta = 1\) (that is, \(\alpha = 1/2\)): \(F = 2PR/(P+R)\)
AUC provides an aggregate measure of performance across all possible classification thresholds.
One way of interpreting AUC is as the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming ‘positive’ ranks higher than ‘negative’). AUC measures how well predictions are ranked, rather than their absolute values.
AUC ranges in value from 0 (100% wrong), to 0.5 (random classifier), to 1 (100% correct). So, for example, if you as a marketer want to find a list of users who will respond to a marketing campaign, AUC is a good metric to use since the predictions ranked by probability is the order in which you will create a list of users to send the marketing campaign.
Gee, I’m building a text classifier for real, now!
What should I do?
If (wheat or grain) and not (whole or bread) then Categorize as grain
Need careful crafting
Human tuning on development data
Time-consuming: 2 days per class
Use Naïve Bayes
Get more labeled data
Try semi-supervised training methods:
Perfect for all the clever classifiers
SVM
Regularized Logistic Regression
You can even use user-interpretable decision trees
Users like to hack
Management likes quick fixes
Can achieve high accuracy!
At a cost:
SVMs (train time) or kNN (test time) can be too slow
Regularized logistic regression can be somewhat better
So Naïve Bayes can come back into its own again!
With enough data
Domain-specific features and weights: very important in real performance
Sometimes need to collapse terms:
Part numbers, chemical formulas, …
But stemming generally doesn’t help
title words
first sentence of each paragraph (Murata, 1999)
In sentences that contain title words
Corpus: is a large and structured set of texts
Stop words: words which are filtered out before or after processing of natural language data (text)
Unstructured text: information that either does not have a pre-defined data model or is not organized in a pre-defined manner.
Tokenizing: process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens (see also lexical analysis)
Natural language processing: field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages
Term document (or document term) matrix: is a mathematical matrix that describes the frequency of terms that occur in a collection of documents
Supervised learning: is the machine learning task of inferring a function from labeled training data
Unsupervised learning: find hidden structure in unlabeled data
Vector space model & BOW
Feature Selection
Text Classification
Evaluation